Categories

Versions

Tokenize (Text Processing)

Synopsis

Tokenizes a document.

Description

This operator splits the text of a document into a sequence of tokens. There are several options how to specify the splitting points. Either you may use all non-letter character, what is the default settings. This will result in tokens consisting of one single word, what's the most appropriate option before finally building the word vector

Or if you are going to build windows of tokens or something like that, you will probably split complete sentences, this is possible by setting the split mode to specify character and enter all splitting characters.

The third option let's you define regular expressions and is the most flexible for very special cases. Each non-letter character is used as separator. As a result, each word in the text is represented by a single token.

Input

  • document

    The document port.

Output

  • document

    The document port.

Parameters

  • modeThis selects the tokenization mode. Depending on the mode, split points are chosen differently. Range:
  • charactersThe incoming document will be split into tokens on each of this characters. For example enter a '.' for splitting into sentences. Range:
  • expressionThis regular expression defines the splitting point. Range:
  • languageThe language for the used part of speech (POS) tagger. Range:
  • max_token_lengthThe maximal token length of the tokens Range: